4 research outputs found
Asynchronous Training of Word Embeddings for Large Text Corpora
Word embeddings are a powerful approach for analyzing language and have been
widely popular in numerous tasks in information retrieval and text mining.
Training embeddings over huge corpora is computationally expensive because the
input is typically sequentially processed and parameters are synchronously
updated. Distributed architectures for asynchronous training that have been
proposed either focus on scaling vocabulary sizes and dimensionality or suffer
from expensive synchronization latencies.
In this paper, we propose a scalable approach to train word embeddings by
partitioning the input space instead in order to scale to massive text corpora
while not sacrificing the performance of the embeddings. Our training procedure
does not involve any parameter synchronization except a final sub-model merge
phase that typically executes in a few minutes. Our distributed training scales
seamlessly to large corpus sizes and we get comparable and sometimes even up to
45% performance improvement in a variety of NLP benchmarks using models trained
by our distributed procedure which requires of the time taken by the
baseline approach. Finally we also show that we are robust to missing words in
sub-models and are able to effectively reconstruct word representations.Comment: This paper contains 9 pages and has been accepted in the WSDM201
"Picture the scene...";
Due to the advent of social media and web 2.0, we are faced with a deluge of information; recently, research efforts have focused on filtering out noisy, irrelevant information items from social media streams and in particular have attempted to automatically identify and summarise events. However, due to the heterogeneous nature of such social media streams, these efforts have not reached fruition. In this paper, we investigate how images can be used as a source for summarising events. Existing approaches have considered only textual summaries which are often poorly written, in a different language and slow to digest. Alternatively, images are "worth 1,000 words" and are able to quickly and easily convey an idea or scene. Since images in social media can also be noisy, irrelevant and repetitive, we propose new techniques for their automatic selection, ranking and presentation. We evaluate our approach on a recently created social media event data set containing 365k tweets and 50 events, for which we extend by collecting 625k related images. By conducting two crowdsourced evaluations, we firstly show how our approach overcomes the problems of automatically collecting relevant and diverse images from noisy microblog data, before highlighting the advantages of multimedia summarisation over text based approaches